Author

Seunghun Lee

Published

December 1, 2023

Code
## Clear environment
rm(list = ls())

1 Introduction

The notebook is to execute exploratory data analysis for IPEDS data.

Data source: IPEDS

R code: Github

2 Preparations

Loading necessary libraries for data wrangling and visualization.

Code
libs <- c('openxlsx', "stringr",                        # read/write data
          'skimr', "PerformanceAnalytics", "corrplot",  # exploring data
          "dplyr", "stringr", "glue",                   # data manipulation
          'gt', 'ggplot2', "patchwork", "psych",        # data visualization
          "naniar", "mice",                             # missing data analysis
          "stats", "paran", "factoextra",               # feature selection
          "rsample", "glmnet")                          # predictive analysis
    

invisible(lapply(libs, library, character.only = TRUE))

Create helper functions to facilitate an analysis

Code
# negation of %in%
'%!in%' <- function(x,y)!('%in%'(x,y)) 

# prevent clashes with other packages
select <- dplyr::select 

# Sanity check custom function
sanity_check <- function(data, type, vars) {
  library(dplyr)
  enquo_cols <- enquos(vars)
  if(type == "missing"){
    # Check missing values
    n_missing <- data %>% summarise(n = sum(is.na({{vars}})))
    print(paste0("The number of missing value in", " ", deparse(substitute(vars)), " ", "is", " ", toString(n_missing)))
    data %>% filter(is.na({{vars}})) %>% return()
  } else if(type == "duplication"){
    # Check duplicated values
    n_dup <- data %>% group_by_at(enquo_cols) %>% tally() %>% filter(n > 1) %>% select({{vars}}) %>% with(nrow(.))
    print(paste0("The unique number of duplicated value", " is ", toString(n_dup)))
    dup_list <- data %>% group_by_at(enquo_cols) %>% tally() %>% filter(n > 1) %>% select({{vars}})
    data %>% inner_join(dup_list, by = names(select(., {{vars}}))) %>% return()
  } else if (type == "skim") {
    # Skim data
    skimr::skim(data) %>% return()
  }
}

Importing data for our analysis.

Code
ipeds <- read.xlsx("../data/IPEDS_data.xlsx") 

3 Overview: Structure and Data Content

The first thing I want to do is to look at the actual data in its raw form. It will show the types of features I am dealing with (numeric, categorical, string, etc.), as well as will reveal some characteristics of the dataset. This includes the process of missing data analysis.

3.1 Skimming data

I start by examining the data through the skimr tool. It allows us to see the data types involved and the shape of the data.

Code
skim(ipeds) %>% summary()
Data summary
Name ipeds
Number of rows 1534
Number of columns 145
_______________________
Column type frequency:
character 27
numeric 118
________________________
Group variables None

I find:

  • The dataset contains 145 columns and 1534 rows, which is relatively small in terms of sample size. The number of variables is considerably large considering the sample size.
  • There is ID.number column with appears to be an unique identifier.
  • There are 27 string columns in the dataset. Those consist of basic information of the institution and the features (e.g. degree offered, religious affiliation).
  • There are 118 numeric columns in the dataset. Some of columns are suppposed to be converted to the categorical variable considering their forms (e.g. ID.number)
  • Missing values can be observed from the data. This is something I need to keep in mind in future exploratory and modelling steps.

3.2 Removing redundant variables

Let’s select only “useful” information from the data

Code
ipeds %>%
  select(Religious.affiliation, Level.of.institution, Control.of.institution, Historically.Black.College.or.University, Tribal.college) %>%
  names() %>%
  data.frame() %>% rename("Categorical Variables" = ".") %>%
  gt()
Categorical Variables
Religious.affiliation
Level.of.institution
Control.of.institution
Historically.Black.College.or.University
Tribal.college

Categorical variables selection:

  • Religious.affiliation: to which religion is the school affiliated (e.g. Churches of Christ)
  • Level.of.institution: the level of degree being offered (e.g. Four or more years)
  • Copntrol.of.institution: indicates whether it is private or public school
  • Historically.Black.College.or.University: indicates whether it is historically black institution
  • Tribal.college: indicates whether the college is affiliated with a tribe
Code
ipeds %>% 
  select(ID.number, Enrolled.total, ACT.Composite.75th.percentile.score, contains("Percent.of.undergraduate.enrollment.that.are."), `Number.of.first-time.undergraduates.-.in-state`, `Graduation.rate.-.Bachelor.degree.within.4.years,.total`, Percent.of.freshmen.receiving.Pell.grants) %>%
  names() %>%
  data.frame() %>% rename("Numeric Variables" = ".") %>%
  gt()
Numeric Variables
ID.number
Enrolled.total
ACT.Composite.75th.percentile.score
Percent.of.undergraduate.enrollment.that.are.American.Indian.or.Alaska.Native
Percent.of.undergraduate.enrollment.that.are.Asian
Percent.of.undergraduate.enrollment.that.are.Black.or.African.American
Percent.of.undergraduate.enrollment.that.are.Hispanic/Latino
Percent.of.undergraduate.enrollment.that.are.Native.Hawaiian.or.Other.Pacific.Islander
Percent.of.undergraduate.enrollment.that.are.White
Percent.of.undergraduate.enrollment.that.are.two.or.more.races
Percent.of.undergraduate.enrollment.that.are.Race/ethnicity.unknown
Percent.of.undergraduate.enrollment.that.are.Nonresident.Alien
Percent.of.undergraduate.enrollment.that.are.Asian/Native.Hawaiian/Pacific.Islander
Percent.of.undergraduate.enrollment.that.are.women
Number.of.first-time.undergraduates.-.in-state
Graduation.rate.-.Bachelor.degree.within.4.years,.total
Percent.of.freshmen.receiving.Pell.grants

Numeric variables selection:

  • ID.number: the school unique identifier
  • Enrolled.total: the number of student who were actually enrolled
  • ACT.Composite.75th.percentile.score: 75th percentile composite ACT score
  • Percent.of.undergraduate.enrollment.that.are: variables indicate the percentage of ethnicity of undergraduate who enrolled
  • Number.of.first-time.undergraduates.-.in-state: indicates the number of in-state undergraduates who are included in the group of “first-time college entrants”
  • Graduation.rate.-.Bachelor.degree.within.4.years,.total: the percentage of undergraduates graduating within 4 years
  • Percent.of.freshmen.receiving.Pell.grants: the percentage of freshmen who received the Federal Pell Grant
Code
## Change the nambes of variables 
ipeds_sub <- ipeds %>% 
  select(id = ID.number,
         bach = `Offers.Bachelor's.degree`,
         enrolled_total = Enrolled.total, 
         act_75th_percentile = ACT.Composite.75th.percentile.score, 
         contains("Percent.of.undergraduate.enrollment.that.are."), 
         ftic_perc = `Percent.of.first-time.undergraduates.-.in-state`,
         bachelor_within_4years = `Graduation.rate.-.Bachelor.degree.within.4.years,.total`, 
         pell_grant_perc = Percent.of.freshmen.receiving.Pell.grants, 
         religion = Religious.affiliation, 
         edu_level = Level.of.institution, 
         sch_type = Control.of.institution, 
         black_inst = Historically.Black.College.or.University, 
         tribal_inst = Tribal.college) %>%
  rename("perc_ai_an" = Percent.of.undergraduate.enrollment.that.are.American.Indian.or.Alaska.Native,
         "perc_as" = Percent.of.undergraduate.enrollment.that.are.Asian,
         "perc_bl" = Percent.of.undergraduate.enrollment.that.are.Black.or.African.American,
         "perc_hi" = `Percent.of.undergraduate.enrollment.that.are.Hispanic/Latino`,
         "perc_nh_pi" = Percent.of.undergraduate.enrollment.that.are.Native.Hawaiian.or.Other.Pacific.Islander,
         "perc_wh" = Percent.of.undergraduate.enrollment.that.are.White,
         "perc_mix" = Percent.of.undergraduate.enrollment.that.are.two.or.more.races,
         "perc_un" = `Percent.of.undergraduate.enrollment.that.are.Race/ethnicity.unknown`,
         "perc_in" = Percent.of.undergraduate.enrollment.that.are.Nonresident.Alien,
         "perc_f" = Percent.of.undergraduate.enrollment.that.are.women
         ) %>%
  mutate(sch_type = ifelse(sch_type == "Public", 1, 0)) %>%
  select(-`Percent.of.undergraduate.enrollment.that.are.Asian/Native.Hawaiian/Pacific.Islander`) 

## Feature transformation
ipeds_sub <- ipeds_sub %>% 
  mutate(across(is.character, as.factor)) 

4 Missing Data Analysis

Let’s have a closer look at those missing values. How many are there in total in the dataset?

Code
miss_var_summary(ipeds_sub %>% filter(!is.na(enrolled_total), enrolled_total != 0) %>% filter(bach == "Yes")) %>% data.frame() %>% gt()
variable n_miss pct_miss
ftic_perc 517 37.5726744
act_75th_percentile 177 12.8633721
bachelor_within_4years 9 0.6540698
pell_grant_perc 3 0.2180233
id 0 0.0000000
bach 0 0.0000000
enrolled_total 0 0.0000000
perc_ai_an 0 0.0000000
perc_as 0 0.0000000
perc_bl 0 0.0000000
perc_hi 0 0.0000000
perc_nh_pi 0 0.0000000
perc_wh 0 0.0000000
perc_mix 0 0.0000000
perc_un 0 0.0000000
perc_in 0 0.0000000
perc_f 0 0.0000000
religion 0 0.0000000
edu_level 0 0.0000000
sch_type 0 0.0000000
black_inst 0 0.0000000
tribal_inst 0 0.0000000

There are various missing values in most of the columns. In this case, it is good to categorize them into the types of missing data.

I find:

  • pell_grant_perc: MCAR, hard to figure out the rationale of the missing data due to its small number (3 NAs) – will be removed from the dataset
  • bachelor_within_4years: MCAR, as other related columns (Bachelor_within_5years and Bachelor_within_6years) are also empty, it appears that these entries were not reported – will be removed from the dataset
  • act_75th_percentile: MAR, some applicants chose to take either the SAT or ACT for their college admission requirements. Assuming that the SAT score is positively correlated with the ACT score, multiple imputation will be conducted to fill in the blank.
  • ftic_perc: the ftic_perc column contains a significant number of missing values, accounting for 37.6% of the data. To address this issue, we should investigate potential methods for imputing these missing values. While we currently lack information regarding any potential issues during the data collection process, it might be beneficial to explore insights from the dataset, such as information available in other columns, in order to assist with imputation. Following columns are potential variables which might be related to the ftic_perc - any ethnic properties, school type, pell_grant_perc
Code
foo <- ipeds %>% 
  filter(!is.na(Enrolled.total)) %>%
  filter(!is.na(`Offers.Bachelor's.degree`)) %>%
  filter(!is.na(`Percent.of.first-time.undergraduates.-.in-state`)) %>%
  filter(!is.na(Percent.of.freshmen.receiving.Pell.grants)) %>%
  select(contains("Percent.of.undergraduate.enrollment.that.are."),
         Control.of.institution,
         Historically.Black.College.or.University,
         Tribal.college,
         Percent.of.freshmen.receiving.Pell.grants,
         `Percent.of.first-time.undergraduates.-.in-state`) %>% 
  mutate(Control.of.institution = ifelse(Control.of.institution == "Public", 1, 0),
         Historically.Black.College.or.University = ifelse(Historically.Black.College.or.University == "Yes", 1, 0),
         Tribal.college = ifelse(Tribal.college == "Yes", 1, 0)) %>% 
  rename("perc_ai_an" = Percent.of.undergraduate.enrollment.that.are.American.Indian.or.Alaska.Native,
       "perc_as" = Percent.of.undergraduate.enrollment.that.are.Asian,
       "perc_bl" = Percent.of.undergraduate.enrollment.that.are.Black.or.African.American,
       "perc_hi" = `Percent.of.undergraduate.enrollment.that.are.Hispanic/Latino`,
       "perc_nh_pi" = Percent.of.undergraduate.enrollment.that.are.Native.Hawaiian.or.Other.Pacific.Islander,
       "perc_wh" = Percent.of.undergraduate.enrollment.that.are.White,
       "perc_mix" = Percent.of.undergraduate.enrollment.that.are.two.or.more.races,
       "perc_un" = `Percent.of.undergraduate.enrollment.that.are.Race/ethnicity.unknown`,
       "perc_in" = Percent.of.undergraduate.enrollment.that.are.Nonresident.Alien,
       "perc_f" = Percent.of.undergraduate.enrollment.that.are.women,
       "bl_colllege" = Historically.Black.College.or.University,
       "tribal_college" = Tribal.college,
       "sch_type" = Control.of.institution,
       "pell" = Percent.of.freshmen.receiving.Pell.grants,
       "ftic" = `Percent.of.first-time.undergraduates.-.in-state`
       ) %>%
  select(-`Percent.of.undergraduate.enrollment.that.are.Asian/Native.Hawaiian/Pacific.Islander`) 
  
chart.Correlation(foo, histogram=TRUE)

As indicated in the correlation chart above, the top three variables associated with ftic are school_type (public or private), pell_grant_perc, and perc_in (non-resident alien). This findings align with my expectation for several reasons. First, non-resident aliens, who are international students, typically attend college in their home countries before attending college in the US. Second, students who can afford the tuition at multiple colleges are generally ineligible for the Pell Grant.

Code
foo %>%
  group_by(sch_type) %>%
  summarise(mean(ftic),
            median(ftic)) %>%
  data.frame() %>%
  gt()
sch_type mean.ftic. median.ftic.
0 56.04018 59.5
1 81.93154 87.0

The most interesting variables is ‘school_type.’ In this analysis, I have defined “1” as public and “0” as private school. Notably, the proportion of ftic students is significantly higher in public schools. Here, we may hypothesize that students who already possess a bachelor’s degree tend to opt for private colleges, which are perceived to offer higher educational quality. Please note that this assumption is for the purpose of explaining the result and may be open to debate.

Based on the finding and the assumption, I define the missing values in ftic column as MAR and will conduct the multiple imputation using school_type, pell_grant_perc, and perc_in.

4.1 Listwise deletion

Simply remove the cases with NAs in such columns - pell_grant_pec, bachelor_within_4years

Code
ipeds_sub_ld <- ipeds_sub %>% 
  filter(enrolled_total != 0,
         !is.na(enrolled_total)) %>% 
  filter(bach == "Yes") %>% 
  filter(!is.na(pell_grant_perc)) %>% # remove NA cases from pell_grant_perc
  filter(!is.na(bachelor_within_4years)) # remove NA cases from bachelor_within_4years

ipeds_sub_ld %>% select(pell_grant_perc, bachelor_within_4years) %>% miss_var_summary() %>% data.frame() %>% gt()
variable n_miss pct_miss
pell_grant_perc 0 0
bachelor_within_4years 0 0

4.2 Multiple Imputation

Let’s create sat_comp variable to combine all the SAT scores and conduct an univariate imputation with the combined sat score.

Imputation information is below:

Code
ipeds_sat <- ipeds %>% 
  select(ID.number, 
                 SAT.Critical.Reading.75th.percentile.score, 
                 SAT.Math.75th.percentile.score, 
                 SAT.Writing.75th.percentile.score) %>% 
  rename(id = ID.number, 
         sat_read = SAT.Critical.Reading.75th.percentile.score, 
         sat_math = SAT.Math.75th.percentile.score, 
         sat_writing = SAT.Writing.75th.percentile.score) %>%
  mutate(sat_comp = rowMeans(subset(., select = c(sat_read, sat_math, sat_writing)), na.rm = TRUE)) 

ipeds_sub_ld_sat <- ipeds_sub_ld %>%
  left_join(ipeds_sat, by = "id") 

ipeds_sub_ld_sat_cleaned <- ipeds_sub_ld_sat %>% 
  mutate(drop = ifelse(is.nan(sat_comp), 1, 0)) %>%
  filter(drop == 0)

imp <- mice(ipeds_sub_ld_sat_cleaned %>% select(act_75th_percentile, sat_comp), 
            m = 10,
            maxit = 10,
            defaultMethod = "pmm",
            printFlag = FALSE)

print(imp)
Class: mids
Number of multiple imputations:  10 
Imputation methods:
act_75th_percentile            sat_comp 
              "pmm"                  "" 
PredictorMatrix:
                    act_75th_percentile sat_comp
act_75th_percentile                   0        1
sat_comp                              1        0
Code
ipeds_sub_ld_sat_cleaned_act <- ipeds_sub_ld_sat_cleaned %>%
  select(-c(act_75th_percentile, sat_comp)) %>%
  cbind(complete(imp))

Conduct multiple imputation with pecc_in, pell_grant_perc, and sch_type.

Imputation information is below:

Code
foo <- ipeds_sub_ld_sat_cleaned_act %>% select(perc_in, pell_grant_perc, sch_type, ftic_perc)

imp <- mice(foo, 
     m = 10,
     maxit = 10,
     defaultMethod = "pmm",
     printFlag = FALSE)

print(imp)
Class: mids
Number of multiple imputations:  10 
Imputation methods:
        perc_in pell_grant_perc        sch_type       ftic_perc 
             ""              ""              ""           "pmm" 
PredictorMatrix:
                perc_in pell_grant_perc sch_type ftic_perc
perc_in               0               1        1         1
pell_grant_perc       1               0        1         1
sch_type              1               1        0         1
ftic_perc             1               1        1         0
Code
ipeds_sub_ld_sat_cleaned_act_ftic <- ipeds_sub_ld_sat_cleaned_act %>%
  select(-c(ftic_perc)) %>%
  cbind(complete(imp) %>% select(ftic_perc)) 

4.3 Check

Let’s clean the data and check the missing values again.

Code
ipeds_sub_ld_sat_cleaned_act_ftic2 <- ipeds_sub_ld_sat_cleaned_act_ftic %>% 
  select(-c(bach, drop, religion, edu_level, tribal_inst,sat_read, sat_math, sat_writing)) %>%
  mutate(sch_type_desc = ifelse(sch_type == 1, "public", "private")) %>%
  mutate(black_inst_d = ifelse(black_inst == "Yes", 1, 0))

miss_var_summary(ipeds_sub_ld_sat_cleaned_act_ftic2) %>% data.frame() %>% gt()
variable n_miss pct_miss
id 0 0
enrolled_total 0 0
perc_ai_an 0 0
perc_as 0 0
perc_bl 0 0
perc_hi 0 0
perc_nh_pi 0 0
perc_wh 0 0
perc_mix 0 0
perc_un 0 0
perc_in 0 0
perc_f 0 0
bachelor_within_4years 0 0
pell_grant_perc 0 0
sch_type 0 0
black_inst 0 0
act_75th_percentile 0 0
sat_comp 0 0
ftic_perc 0 0
sch_type_desc 0 0
black_inst_d 0 0
Code
ipeds_imp <- ipeds_sub_ld_sat_cleaned_act_ftic2 

No more missing value. Good!

5 Visualization

Let’s take a look at the distribution of variables

5.1 Predictor Features

To assemble the plots in the comprehensive, dashboard-like overview we are using the patchwork package.

Code
p1 <- ipeds_imp %>% 
  ggplot(aes(pell_grant_perc)) +
  geom_density(fill = "blue") +
  theme_minimal() +
  labs(x = "", title = "Pell Grant Percentage")

p2 <- ipeds_imp %>% 
  count(sch_type_desc) %>%
  ggplot(aes(sch_type_desc, n, fill = sch_type_desc)) +
  geom_col() +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(x = "", title = "School Type")

p3 <- ipeds_imp %>% 
  count(black_inst) %>%
  ggplot(aes(black_inst, n, fill = black_inst)) +
  geom_col() +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(x = "", title = "Black Institution") 

p4 <- ipeds_imp %>% 
  ggplot(aes(act_75th_percentile)) +
  geom_density(fill = "darkgreen") +
  theme_minimal() +
  labs(x = "", title = "ACT 75th Percentile")

p5 <- ipeds_imp %>% 
  ggplot(aes(ftic_perc)) +
  geom_density(fill = "violet") +
  theme_minimal() +
  labs(x = "", title = "FTIC Percentage")


design <- "
ABC
DEE
"

p1 + p2 + p3 + p4 + p5 +
  plot_layout(design = design) +
  plot_annotation(title = "Variable Distributions")

I find:

  • The distribution of pell_grant_percentage and act_7hth are nearly normally distributed (slightly right skewed).
  • school_type and black_institution are boolean features.
  • ftic_perc is severely left skewed. May need a transformation if we want to add it in our model.

5.2 Target: bachelor_within_4years

Code
ipeds_imp %>% 
  ggplot(aes(bachelor_within_4years)) +
  geom_density(fill = "blue") +
  theme_minimal() +
  labs(x = "", title = "Target: Bachelor within 4 Years")

We find:

  • There is a slight imbalance in the target distribution. Slightly right skewed.

5.3 Predictor Feature Interactions

How do the predictor features interact with each other?

5.4 Correlations

When plotting a correlation matrix, we need to make sure that what we’re feeding in doesn’t include any missing values. We can specify this via the parameter use = "complete.obs":

Code
ipeds_imp %>% 
  select(-c(id, enrolled_total, sat_comp, sch_type_desc)) %>% 
  mutate(black_inst_d = ifelse(black_inst == "Yes", 1, 0)) %>%
  relocate(bachelor_within_4years, .after = ftic_perc) %>% 
  select(-black_inst) %>%
  chart.Correlation(histogram=TRUE)

We find:

  • Multicollinearity is detected among several variables. Feature reduction process is necessary.

5.4.1 Categorical feature interactions

Code
ipeds_imp %>%
  select(bachelor_within_4years, sch_type_desc) %>%
  ggplot(aes(bachelor_within_4years, group = sch_type_desc, fill = sch_type_desc)) +
  geom_density(alpha = 0.5) +
  theme_minimal() +
  theme(legend.position = "top")

We find:

  • The number of students who got their bachelor in 4 years are larger in private schools than in public school
Code
ipeds_imp %>%
  select(bachelor_within_4years, black_inst) %>%
  ggplot(aes(bachelor_within_4years, group = black_inst, fill = black_inst)) +
  geom_density(alpha = 0.5) +
  theme_minimal() +
  theme(legend.position = "top")

We find:

  • The number of students who got their bachelor in 4 years are larger in non-black institution than in black institution

6 Feature Selection

Conducting Principal Component Analysis using the Horn’s Parallel Analysis.

6.1 Principal Component Analysis

Code
ipeds_imp_sub <- ipeds_imp %>% select(-c(id, enrolled_total, black_inst, sat_comp, sch_type_desc))
pca_model <- prcomp(ipeds_imp_sub, scale=TRUE)  
Code
summary(pca_model)
Importance of components:
                         PC1    PC2    PC3    PC4     PC5     PC6     PC7
Standard deviation     2.052 1.5129 1.3784 1.1634 1.10033 1.01100 0.94417
Proportion of Variance 0.263 0.1431 0.1187 0.0846 0.07567 0.06388 0.05572
Cumulative Proportion  0.263 0.4061 0.5249 0.6095 0.68512 0.74900 0.80472
                           PC8     PC9    PC10    PC11    PC12    PC13    PC14
Standard deviation     0.86231 0.81952 0.67956 0.63665 0.57580 0.43212 0.42625
Proportion of Variance 0.04647 0.04198 0.02886 0.02533 0.02072 0.01167 0.01136
Cumulative Proportion  0.85119 0.89317 0.92203 0.94736 0.96808 0.97975 0.99111
                          PC15    PC16
Standard deviation     0.37634 0.02485
Proportion of Variance 0.00885 0.00004
Cumulative Proportion  0.99996 1.00000
Code
get_eigenvalue(pca_model)
         eigenvalue variance.percent cumulative.variance.percent
Dim.1  4.2086890938     26.304306836                    26.30431
Dim.2  2.2888852662     14.305532914                    40.60984
Dim.3  1.8999705707     11.874816067                    52.48466
Dim.4  1.3536009035      8.460005647                    60.94466
Dim.5  1.2107185789      7.566991118                    68.51165
Dim.6  1.0221299396      6.388312123                    74.89996
Dim.7  0.8914564220      5.571602638                    80.47157
Dim.8  0.7435799882      4.647374926                    85.11894
Dim.9  0.6716187420      4.197617137                    89.31656
Dim.10 0.4618060529      2.886287831                    92.20285
Dim.11 0.4053266757      2.533291723                    94.73614
Dim.12 0.3315456400      2.072160250                    96.80830
Dim.13 0.1867280357      1.167050223                    97.97535
Dim.14 0.1816913853      1.135571158                    99.11092
Dim.15 0.1416349611      0.885218507                    99.99614
Dim.16 0.0006177444      0.003860902                   100.00000
Code
fviz_eig(pca_model, 
         choice = c("eigenvalue"),
         col.var="blue")

Based on the proportion of explained variance and the scree plot using eigenvalue, 4 components seem appropriate. However, recognizing the interpretation is subjective, an additional analysis is necessary to validate its plausibility.

6.2 Parallel Analysis

Code
paran(ipeds_imp_sub, graph = TRUE, cfa = FALSE, centile = 95)

Using eigendecomposition of correlation matrix.
Computing: 10%  20%  30%  40%  50%  60%  70%  80%  90%  100%


Results of Horn's Parallel Analysis for component retention
480 iterations, using the 95 centile estimate

-------------------------------------------------- 
Component   Adjusted    Unadjusted    Estimated 
            Eigenvalue  Eigenvalue    Bias 
-------------------------------------------------- 
1           3.963821    4.208689      0.244867
2           2.096940    2.288885      0.191944
3           1.745690    1.899970      0.154279
4           1.228469    1.353600      0.125131
5           1.114435    1.210718      0.096283
-------------------------------------------------- 

Adjusted eigenvalues > 1 indicate dimensions to retain.
(5 components retained)

According to Horn’s Parallel Analysis, 5 components can be selected.

6.3 Contributions of variables to each PCs

Code
# Contributions of variables to PC1
pc1 <- fviz_contrib(pca_model, choice = "var", axes = 1)
# Contributions of variables to PC2
pc2 <- fviz_contrib(pca_model, choice = "var", axes = 2)
# Contributions of variables to PC3
pc3 <- fviz_contrib(pca_model, choice = "var", axes = 3)
# Contributions of variables to PC4
pc4 <- fviz_contrib(pca_model, choice = "var", axes = 4)
# Contributions of variables to PC5
pc5 <- fviz_contrib(pca_model, choice = "var", axes = 5)

pc1 + pc2 + pc3 + pc4 + pc5 +
  plot_layout(ncol = 2) +
  plot_annotation(title = "Contributions of Variables")